Imagine we’re some fancy data scientists exploring - once again - the gapminder data. We’re particularly interested in the development of the GDP across time and across countries. Some R-fanatics from GESIS recommended using this tidyverse thing to complete our tasks. At the same time, they also hesitate to load all of its R-packages at once.

1

Load all packages from the tidyverse for importing Excel data and for data wrangling.
You can find them in the slides from A3 and A5.
library(readxl)
library(dplyr)
library(tidyr)

Ok, that wasn’t too hard. But data science is about data, so we have to load in the data.

2

Import the GDP gapminder data. Make sure to only import the excel sheet named “Data”.
Individual sheets can be chosen by applying the option sheet = "name_of_your_sheet"
gapminder_example <- 
  readxl::read_excel(
    path = "../data/gapminder/GDPpercapitaconstant2000US.xlsx",
    sheet = "Data"
    )

Have the data been successfully imported? They should comprise a tibble of 275 x 53. Furthermore, the income per person for Algeria of the years 1960, 1961, and 1962 should be 1280, 1085, and 856.

3

Proof that the income per person for Algeria of the years 1960, 1961, and 1962 are 1280, 1085, and 856
Algeria is in the 5th row of the dataset, and the relevant variables are in the first four columns. You can subset datasets also by selecting rows by number with select() and by filtering by number with slice().
gapminder_example %>% 
  select(1:4) %>% 
  slice(n = 5)
## # A tibble: 1 x 4
##   `Income per person (fixed 2000 US$)` `1960.0` `1961` `1962`
##   <chr>                                   <dbl>  <dbl>  <dbl>
## 1 Algeria                                 1280.  1085.   856.

Let’s say we’re interested in the earliest 10 years of development in all countries and in the most recent 10 years. The idea is that there might be some differences between the early days and the new days of GDP development. At first, we’d like to compute such statistics across all countries. Unfortunately, the data are in the wide format.

4

Re-arrange the data such that they are in the long format.
Remember that the command for converting wide format data to long format is gather(). Additionally, you might want to create a more convenient column name for the variable Income per person (fixed 2000 US$) with rename() as its really messy.
gapminder_example %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP")
## # A tibble: 14,300 x 3
##    country               year     GDP
##    <chr>                 <chr>  <dbl>
##  1 Abkhazia              1960.0   NA 
##  2 Afghanistan           1960.0   NA 
##  3 Akrotiri and Dhekelia 1960.0   NA 
##  4 Albania               1960.0   NA 
##  5 Algeria               1960.0 1280.
##  6 American Samoa        1960.0   NA 
##  7 Andorra               1960.0   NA 
##  8 Angola                1960.0   NA 
##  9 Anguilla              1960.0   NA 
## 10 Antigua and Barbuda   1960.0   NA 
## # … with 14,290 more rows

Ok, did it work out? There are still a lot of missing values we might get rid of, and the data are not arranged properly. They make the data untidy, distract us and are not part of any mean calculations anyway. For the next upcoming tasks, simply re-use your code and add the following commands with the %>%.

5

Remove all missing values and arrange the data in ascending years and the GDP.
There are several ways to exclude missing values. The most convenient one is to use filter() in combination with !is.na.
gapminder_example %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  filter(!is.na(GDP)) %>% 
  arrange(year, GDP)
## # A tibble: 7,988 x 3
##    country      year     GDP
##    <chr>        <chr>  <dbl>
##  1 Malawi       1960.0  98.6
##  2 China        1960.0 105. 
##  3 Burundi      1960.0 116. 
##  4 Burkina Faso 1960.0 122. 
##  5 Lesotho      1960.0 132. 
##  6 Nepal        1960.0 139. 
##  7 Togo         1960.0 177. 
##  8 India        1960.0 181. 
##  9 Pakistan     1960.0 187. 
## 10 Indonesia    1960.0 201. 
## # … with 7,978 more rows

Nice. Now we got a - more or less - clean dataset for our actual task: calculating the mean values across all countries for each of the first ten years and each of the last ten years. What’s still a little bit distracting is that we got the values for all years between these two periods in the data. But we decided that we leave them there for some future analyses. As such, we do all analyses on the fly. Let’s start with the first period.

6

Calculate the mean value of GDP across all countries for each of the first ten years.
As the year variable is a double, you can simply filter the range of years you are interested in.
gapminder_example %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  filter(!is.na(GDP)) %>% 
  arrange(year, GDP) %>% 
  filter(year >= 1960 & year <= 1969) %>% 
  group_by(year) %>% 
  summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
##    year   GDP_over_all_countries
##    <chr>                   <dbl>
##  1 1960.0                  2863.
##  2 1961                    2936.
##  3 1962                    3034.
##  4 1963                    3127.
##  5 1964                    3297.
##  6 1965                    3479.
##  7 1966                    3562.
##  8 1967                    3644.
##  9 1968                    3827.
## 10 1969                    4015.

After this was done, you might know how to do that for the 10 most recent years…

7

Calculate the mean value of GDP across all countries for each of the last ten years.
gapminder_example %>% 
  rename(country = `Income per person (fixed 2000 US$)`) %>%
  gather(-country, key = "year", value = "GDP") %>% 
  filter(!is.na(GDP)) %>% 
  arrange(year, GDP) %>% 
  filter(year >= 2002 & year <= 2011) %>% 
  group_by(year) %>% 
  summarise(GDP_over_all_countries = mean(GDP))
## # A tibble: 10 x 2
##    year  GDP_over_all_countries
##    <chr>                  <dbl>
##  1 2002                   7983.
##  2 2003                   8113.
##  3 2004                   8335.
##  4 2005                   8545.
##  5 2006                   8899.
##  6 2007                   9219.
##  7 2008                   8999.
##  8 2009                   8463.
##  9 2010                   7700.
## 10 2011                   7603.